Entering my freshmen year of college, I was around 230 pounds and standing at 6 feet tall. I was able to keep myself somewhat in shape from all the constant exercising and training from the very structured nature of high school sports but with all the freedom in college and no real ambission to keep up with a workout routine, I began to become fairly out of shape. My best friends from high school were very active and were in the gym daily. Coming back home for winter break and seeing my best friends from high school again, I decided I wanted to start my fitness journey with them. Due to COVID-19 sending everyone home from school, this was a great oppurtunity to be able to work out with my friends and have a consistent pattern to my days. After 2 years of real dedication, I lost around 50 pounds and was stronger than ever before! Developing a passion for lifting turned my gaze to powerlifting meets. These events allow people to demonstrate all their hard work, compete against others, and come together to share their fitness journeys. Moreover, I was not alone in this. USA Powerlifting (https://www.washingtonpost.com/arts-entertainment/2019/04/19/using-youtube-show-off-world-powerlifting/) reports that competition membership has more than tripled since 2014, moving from 6,410 members to 22,026. In 2018, the federation listed a 2:1 ratio of men to women registered for competitions, whereas 4 years ago this was at 4:1. So, not only has powerlifting seen a surge in participation over the last few years, but even more equality among genders! With a love for computer science, math, and fitness, I found that choosing a topic to do a Data Science project was not that difficult.
The main purpose of this project is to go through the entire data science pipeline. The amazing team at Open Powerlifting (https://www.openpowerlifting.org/) has brought together and organized millions of powerlifting meet data which is a great place to start for this project! This data set can be downloaded here: https://openpowerlifting.gitlab.io/opl-csv/bulk-csv.html. From this data, I wished to find interesting trends and correlations that may help out and inform novice powerlifters to even pioneers in the field.
Why is this important?¶
As discussed before, powerlifting has expanded rapidly over the last few years. This increase in participation has led to increasing cash prizes for winners, a much more competitive scene, and overall more people interested in learning more about powerlifting. Providing insight on powerlifting meet data can help everyone improve their knowledge of the sport and possibly even improve the sport as a whole!
** ADD THAT FITNESS AND TECH ARE BECOMING VERY INTERMINGLED AND LOTS OF MONEY CAME FROM IT OVER CODVID AND SINCE **
During this project we will explore the key points every Data Science project should inlcude:
In this section, I will use the data set from Open Powerlifting (https://www.openpowerlifting.org/) which is extensive and has the information needed to do analysis so it will be the only data set that will be used for this project. I will be using Python to do all the necessary operations on the data set. First, let's start by importing several relevant Python libraries for this study!
# Essential libraries for all parts of the data science lifecycle
import pandas as pd
import numpy as np
# for plotting data and visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# for explanatory analysis section and creating many models for machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn import metrics
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
I have downloaded the csv file from Open Powerlifting and now will read it in using Pandas. After this, we can clean up the data!
# reading CSV through pandas
df = pd.read_csv('powerlifting.csv', dtype=object)
# displaying Dataframe
df
| Name | Sex | Event | Equipment | Age | AgeClass | BirthYearClass | Division | BodyweightKg | WeightClassKg | ... | Tested | Country | State | Federation | ParentFederation | Date | MeetCountry | MeetState | MeetTown | MeetName | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alona Vladi | F | SBD | Raw | 33 | 24-34 | 24-39 | O | 58.3 | 60 | ... | Yes | Russia | NaN | GFP | NaN | 2019-05-11 | Russia | NaN | Bryansk | Open Tournament |
| 1 | Galina Solovyanova | F | SBD | Raw | 43 | 40-44 | 40-49 | M1 | 73.1 | 75 | ... | Yes | Russia | NaN | GFP | NaN | 2019-05-11 | Russia | NaN | Bryansk | Open Tournament |
| 2 | Daniil Voronin | M | SBD | Raw | 15.5 | 16-17 | 14-18 | T | 67.4 | 75 | ... | Yes | Russia | NaN | GFP | NaN | 2019-05-11 | Russia | NaN | Bryansk | Open Tournament |
| 3 | Aleksey Krasov | M | SBD | Raw | 35 | 35-39 | 24-39 | O | 66.65 | 75 | ... | Yes | Russia | NaN | GFP | NaN | 2019-05-11 | Russia | NaN | Bryansk | Open Tournament |
| 4 | Margarita Pleschenkova | M | SBD | Raw | 26.5 | 24-34 | 24-39 | O | 72.45 | 75 | ... | Yes | Russia | NaN | GFP | NaN | 2019-05-11 | Russia | NaN | Bryansk | Open Tournament |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2887194 | Inna Orobets | F | SBD | Single-ply | 45 | 45-49 | 40-49 | Open | 109.3 | 84+ | ... | Yes | Ukraine | NaN | IPF | IPF | 2016-11-14 | USA | NaN | Orlando | World Open Championships |
| 2887195 | Ágnes Szabó | F | SBD | Single-ply | 27.5 | 24-34 | 24-39 | Open | 113.22 | 84+ | ... | Yes | Hungary | NaN | IPF | IPF | 2016-11-14 | USA | NaN | Orlando | World Open Championships |
| 2887196 | Brenda van der Meulen | F | SBD | Single-ply | 37.5 | 35-39 | 24-39 | Open | 110.79 | 84+ | ... | Yes | Netherlands | NaN | IPF | IPF | 2016-11-14 | USA | NaN | Orlando | World Open Championships |
| 2887197 | Kelly Phasey | F | SBD | Single-ply | 39.5 | 35-39 | 40-49 | Open | 124.53 | 84+ | ... | Yes | England | NaN | IPF | IPF | 2016-11-14 | USA | NaN | Orlando | World Open Championships |
| 2887198 | Hildeborg Juvet Hugdal | F | SBD | Single-ply | 32.5 | 24-34 | 24-39 | Open | 123 | 84+ | ... | Yes | Norway | NaN | IPF | IPF | 2016-11-14 | USA | NaN | Orlando | World Open Championships |
2887199 rows × 41 columns
Normally, we generally only display the head (first few rows) of a data set; however, I wanted to make note of how large this data set was. Almost 3 million rows alongide 41 features! For any of you that are a bit new to powerlifting or don't know anything at all, here is some documentation explaining each of these 41 columns (https://openpowerlifting.gitlab.io/opl-csv/bulk-csv-docs.html).
As every data scientist knows, most of this information is not completely "clean" or needed for our analysis. That is, the data contains values we do not really care about, is not uniform in types, contains a lot of missing data that will make doing operations harder or not feasible, and much more.
With that being said, we can move onto Part 2 of the Data Collection phase: Tidying the Data! First, there are a few aspects of this data set that I won't be necessary for the type of analysis I plan to do. Moreover, since this data set is so large, it will be helpful to decrease the number of rows. Most of the time, we don't want to just drop all rows that contain Nan (ie no data) values, but it is acceptable for something of this size. Moreover, I also want to concentrate on certain lifting federations: USAPL and IPF. The reason for this is that these federations preform drug tests which will make the playing field a bit more even for each entry and these federations are fairly strict on referring. This means that lifts that were counted for meets at these federations will likely be very legitatmite. Lastly, I also want to focus my attention on individuals who competed in the SBD event. This stands for Squat, Bench, and Deadlift. These are the 3 main lifts for powerlifting and are considered standard for any meet.
Now that we have some considerations in mind, let's do some processing!
# get rid of unnecessary columns
df = df[['Date', 'Age', 'Sex', 'Event', 'BirthYearClass', 'BodyweightKg', \
'Squat1Kg', 'Squat2Kg', 'Squat3Kg', 'Best3SquatKg', \
'Bench1Kg', 'Bench2Kg', 'Bench3Kg', 'Best3BenchKg', \
'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg', \
'Best3DeadliftKg', 'TotalKg', 'Place', 'Dots', 'Wilks', 'Glossbrenner', \
'Tested', 'Federation']]
# drop all rows with Nan values since we only want to analyze people who completed all their lifts
# and have important data like their weight and age/birthyearclass
df = df.dropna()
# restrict to just USAPL and IPF federations
df = df[(df['Federation'] == 'USAPL') | (df['Federation'] == 'IPF')]
# restrict to SBD and Tested meets only
df = df[(df['Event'] == 'SBD') & (df['Tested'] == 'Yes')]
df
| Date | Age | Sex | Event | BirthYearClass | BodyweightKg | Squat1Kg | Squat2Kg | Squat3Kg | Best3SquatKg | ... | Deadlift2Kg | Deadlift3Kg | Best3DeadliftKg | TotalKg | Place | Dots | Wilks | Glossbrenner | Tested | Federation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307302 | 2018-11-10 | 48 | F | SBD | 40-49 | 51.35 | 102.5 | 110 | 112.5 | 112.5 | ... | 122.5 | 127.5 | 127.5 | 292.5 | 1 | 359.67 | 368.19 | 326.53 | Yes | USAPL |
| 307304 | 2018-11-10 | 47 | F | SBD | 40-49 | 71.6 | 100 | 112.5 | 120 | 120 | ... | 142.5 | 150 | 150 | 325 | 1 | 324.56 | 318.39 | 280.50 | Yes | USAPL |
| 307305 | 2018-11-10 | 43.5 | F | SBD | 40-49 | 114.65 | 70 | 80 | 92.5 | 92.5 | ... | 127.5 | 140 | 140 | 315 | 1 | 256.19 | 254.00 | 215.78 | Yes | USAPL |
| 307306 | 2018-11-10 | 38.5 | F | SBD | 24-39 | 51.1 | 92.5 | 95 | 97.5 | 97.5 | ... | 117.5 | 125 | 125 | 275 | 1 | 339.31 | 347.46 | 308.21 | Yes | USAPL |
| 307307 | 2018-11-10 | 26.5 | F | SBD | 24-39 | 57 | 92.5 | 97.5 | 102.5 | 102.5 | ... | 112.5 | 115 | 115 | 277.5 | 1 | 317.93 | 322.01 | 284.73 | Yes | USAPL |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2887193 | 2016-11-14 | 32 | F | SBD | 24-39 | 95 | 250 | 255 | 260 | 260 | ... | 222.5 | 227.5 | 227.5 | 647.5 | 2 | 564.05 | 548.04 | 473.70 | Yes | IPF |
| 2887194 | 2016-11-14 | 45 | F | SBD | 40-49 | 109.3 | 235 | -242.5 | 242.5 | 242.5 | ... | 210 | 215 | 215 | 645 | 3 | 532.83 | 525.14 | 447.12 | Yes | IPF |
| 2887195 | 2016-11-14 | 27.5 | F | SBD | 24-39 | 113.22 | -237.5 | 237.5 | -247.5 | 237.5 | ... | 200 | -210 | 200 | 632.5 | 4 | 516.46 | 511.26 | 434.61 | Yes | IPF |
| 2887196 | 2016-11-14 | 37.5 | F | SBD | 24-39 | 110.79 | 210 | 222.5 | 227.5 | 227.5 | ... | 190 | -195 | 190 | 617.5 | 5 | 507.79 | 501.32 | 426.60 | Yes | IPF |
| 2887197 | 2016-11-14 | 39.5 | F | SBD | 40-49 | 124.53 | 210 | 220 | -227.5 | 220 | ... | -155 | -155 | 145 | 515 | 6 | 409.13 | 409.13 | 345.98 | Yes | IPF |
176734 rows × 25 columns
Again, I don't show just the head of the dataframe so we can see how many rows we are now dealing with. It comes out to about 180,000 rows with 25 features for each. Still a lot of data to work with! After identifying features and aspects we wish to use, we need to tidy up the data a bit more. This mostly includes typecasting each column where necessary. Moreover, we drop any values that were unable to be typecasted or were a bit amnigious which could impact future sections of this project. For example, there are some negative values for lifts which means they failed these attempts and I only want to consider people who succeed on all three attempts to make processing the data easier, values for weightclass that include only a minimum weight (denoted as X+ for some weight X), and more. Factoring all this in, we can now write some code to do all this preprocessing! Let's take a look at some code that will do this.
for i,row in df.iterrows():
row['Date'] = pd.to_datetime(row['Date'])
try:
row['Age'] = float(row['Age'])
except:
row['Age'] = np.nan
# want to nake distinction between Male and Female
if row['Sex'] == 'Mx':
row['Sex'] = np.nan
lifts = ['Squat1Kg', 'Squat2Kg', 'Squat3Kg', 'Best3SquatKg', \
'Bench1Kg', 'Bench2Kg', 'Bench3Kg', 'Best3BenchKg', \
'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg', \
'Best3DeadliftKg', 'TotalKg']
# cast all lift values as floats and get rid of invalid data
for lift in lifts:
try:
row[lift] = float(row[lift])
if row[lift] <= 0:
row[lift] = np.nan
except:
row[lift] = np.nan
# cast all places as integers and get rid of invalid data
try:
row['Place'] = int(row['Place'])
if row['Place'] > 0:
row['Place'] = int(row['Place'])
else:
row['Place'] = np.nan
except:
row['Place'] = np.nan
# cast all scores as floats and get rid of invalid data
for score in ['Dots', 'Wilks', 'Glossbrenner']:
try:
row[score] = float(row[score])
if row[score] > 0:
row[score] = float(row[score])
else:
row[score] = np.nan
except:
row[score] = np.nan
cleanedData = df.dropna()
# We fixed the event to be SBD and the Federations to only be USAPL and IPF which are tested federations
cleanedData = cleanedData.drop(['Event', 'Tested', 'Federation'], axis=1)
cleanedData
| Date | Age | Sex | BirthYearClass | BodyweightKg | Squat1Kg | Squat2Kg | Squat3Kg | Best3SquatKg | Bench1Kg | ... | Best3BenchKg | Deadlift1Kg | Deadlift2Kg | Deadlift3Kg | Best3DeadliftKg | TotalKg | Place | Dots | Wilks | Glossbrenner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307305 | 2018-11-10 00:00:00 | 43.5 | F | 40-49 | 114.65 | 70.0 | 80.0 | 92.5 | 92.5 | 65.0 | ... | 82.5 | 110.0 | 127.5 | 140.0 | 140.0 | 315.0 | 1 | 256.19 | 254.0 | 215.78 |
| 307309 | 2018-11-10 00:00:00 | 24.5 | F | 24-39 | 58.95 | 102.5 | 110.0 | 115.0 | 115.0 | 57.5 | ... | 62.5 | 127.5 | 135.0 | 142.5 | 142.5 | 320.0 | 2 | 358.73 | 361.69 | 319.65 |
| 307314 | 2018-11-10 00:00:00 | 24.5 | F | 24-39 | 71 | 110.0 | 117.5 | 122.5 | 122.5 | 67.5 | ... | 75.0 | 137.5 | 147.5 | 155.0 | 155.0 | 352.5 | 3 | 353.66 | 347.3 | 306.03 |
| 307318 | 2018-11-10 00:00:00 | 30.5 | F | 24-39 | 97.15 | 150.0 | 165.0 | 172.5 | 172.5 | 65.0 | ... | 77.5 | 165.0 | 175.0 | 187.5 | 187.5 | 437.5 | 1 | 377.64 | 367.52 | 316.84 |
| 307319 | 2018-11-10 00:00:00 | 43.5 | F | 40-49 | 114.65 | 70.0 | 80.0 | 92.5 | 92.5 | 65.0 | ... | 82.5 | 110.0 | 127.5 | 140.0 | 140.0 | 315.0 | 2 | 256.19 | 254.0 | 215.78 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2886975 | 2022-08-27 00:00:00 | 18.5 | F | 19-23 | 69.9 | 155.0 | 165.0 | 172.5 | 172.5 | 75.0 | ... | 82.5 | 140.0 | 147.5 | 150.0 | 150.0 | 405.0 | 3 | 409.89 | 403.31 | 355.52 |
| 2886978 | 2022-08-27 00:00:00 | 20.5 | F | 19-23 | 82.3 | 150.0 | 165.0 | 175.0 | 175.0 | 100.0 | ... | 112.5 | 130.0 | 140.0 | 150.0 | 150.0 | 437.5 | 3 | 406.53 | 394.24 | 345.14 |
| 2886998 | 2016-11-14 00:00:00 | 37.5 | M | 24-39 | 65.8 | 250.0 | 260.0 | 262.5 | 262.5 | 190.0 | ... | 200.0 | 275.0 | 287.5 | 300.0 | 300.0 | 762.5 | 3 | 598.79 | 600.21 | 583.31 |
| 2887072 | 2016-11-14 00:00:00 | 31.0 | M | 24-39 | 118.25 | 400.0 | 415.0 | 422.5 | 422.5 | 310.0 | ... | 327.5 | 347.5 | 360.0 | 375.0 | 375.0 | 1125.0 | 1 | 649.29 | 649.05 | 621.82 |
| 2887115 | 2016-11-14 00:00:00 | 39.5 | F | 40-49 | 51.59 | 155.0 | 165.0 | 175.0 | 175.0 | 90.0 | ... | 100.0 | 150.0 | 160.0 | 165.0 | 165.0 | 440.0 | 6 | 539.28 | 551.88 | 489.34 |
25616 rows × 22 columns
Fantastic! Our data is cleaned and we are still left with around 25,000 rows which is more than enough to do analysis. We were lucky that Open Powerlifting has a fairly well kept data set which made this section of the project a bit easier than a lot of data science projects. Below, you can see some code I wrote to save this cleaned-up data to a CSV file to reduce the amount of kernal restarts I was having and needing to reread in the large data set from the beginning. It is commented out for now as it is not necessary to run.
#cleanedData = cleanedData.astype({'Date': 'datetime64[ns]', 'Age': 'float64', 'BodyweightKg': 'float64', \
# 'Squat1Kg': 'float64', 'Squat2Kg': 'float64', 'Squat3Kg': 'float64', 'Best3SquatKg': 'float64',\
# 'Bench1Kg': 'float64', 'Bench2Kg': 'float64', 'Bench3Kg': 'float64', 'Best3BenchKg': 'float64',\
# 'Deadlift1Kg': 'float64', 'Deadlift2Kg': 'float64', 'Deadlift3Kg': 'float64', 'Best3DeadliftKg': 'float64',\
# 'TotalKg': 'float64', 'Place': 'int', 'Dots': 'float64', 'Wilks': 'float64', 'Glossbrenner': 'float64'})
#cleanedData.dtypes
# save cleaned data so that I can do analysis without having to run all above cells on a kernal restart
#cleanedData.to_csv('cleaned_data.csv', index=False)
In this section of the data science lifycycle, we aim to try and graph/visualize our data set to see if there are any types of trends, possible correlations, and any factors that may or may not be important for our analysis and models in the later sections. From these graphs, we can begin to try and come up with ideas on what types of parameters we want for our models, what may or may not be something interesting to try and explore, and overall get a better understanding of the data as a whole! Let's start by reading in the cleaned-up data CSV file.
# reading CSV through pandas
df = pd.read_csv('cleaned_data.csv')
# displaying head of Dataframe
df.head()
| Date | Age | Sex | BirthYearClass | BodyweightKg | Squat1Kg | Squat2Kg | Squat3Kg | Best3SquatKg | Bench1Kg | ... | Best3BenchKg | Deadlift1Kg | Deadlift2Kg | Deadlift3Kg | Best3DeadliftKg | TotalKg | Place | Dots | Wilks | Glossbrenner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2018-11-10 | 43.5 | F | 40-49 | 114.65 | 70.0 | 80.0 | 92.5 | 92.5 | 65.0 | ... | 82.5 | 110.0 | 127.5 | 140.0 | 140.0 | 315.0 | 1 | 256.19 | 254.00 | 215.78 |
| 1 | 2018-11-10 | 24.5 | F | 24-39 | 58.95 | 102.5 | 110.0 | 115.0 | 115.0 | 57.5 | ... | 62.5 | 127.5 | 135.0 | 142.5 | 142.5 | 320.0 | 2 | 358.73 | 361.69 | 319.65 |
| 2 | 2018-11-10 | 24.5 | F | 24-39 | 71.00 | 110.0 | 117.5 | 122.5 | 122.5 | 67.5 | ... | 75.0 | 137.5 | 147.5 | 155.0 | 155.0 | 352.5 | 3 | 353.66 | 347.30 | 306.03 |
| 3 | 2018-11-10 | 30.5 | F | 24-39 | 97.15 | 150.0 | 165.0 | 172.5 | 172.5 | 65.0 | ... | 77.5 | 165.0 | 175.0 | 187.5 | 187.5 | 437.5 | 1 | 377.64 | 367.52 | 316.84 |
| 4 | 2018-11-10 | 43.5 | F | 40-49 | 114.65 | 70.0 | 80.0 | 92.5 | 92.5 | 65.0 | ... | 82.5 | 110.0 | 127.5 | 140.0 | 140.0 | 315.0 | 2 | 256.19 | 254.00 | 215.78 |
5 rows × 22 columns
It is standard for powerlifting meets to group up individuals by body weight classes so that people of similiar sizes are competing against each other. This helps level out the playing field since placement is based off the total (TotalKg column) amount of weight lifted which is computed by taking the best lift of each of the 3 attempts for Bench, Squat, and Deadlift. Some awards are also generally given out for an entire meet based off strength metrics like the Wilks Score, DOTS, and Glossbrenner to point out the lifter who is statically the "strongest". Let's create a weight class column and look at the distribution of the data set for these. Moreover, since it is likely that Males and Females will vary fairly drastically between weights, let's make a seperate plot for both.
import matplotlib.pyplot as plt
# Create Weight Classes
bins = [0, 52, 57, 63, 72, 84, 93, 105, 120, 1200]
labels = ['47kg', '52kg', '57kg', '63kg', '72kg', '84kg', '93kg', '105kg', '120kg+']
df['WeightClass'] = pd.cut(df['BodyweightKg'], bins=bins, labels=labels)
# Filter observations by gender
male_df = df[df['Sex'] == 'M']
female_df = df[df['Sex'] == 'F']
# Get count of observations by weight class for each gender
male_weightclass_counts = male_df['WeightClass'].value_counts()
female_weightclass_counts = female_df['WeightClass'].value_counts()
# Sort weight class categories
sorted_weightclasses = sorted(labels, key=lambda x: labels.index(x))
# Plot count of observations by weight class for males
plt.figure(figsize=(10, 4)) # Adjust the figure size as needed
plt.subplot(1, 2, 1)
male_weightclass_counts.loc[sorted_weightclasses].plot(kind='bar')
plt.title('Count of Observations by Weight Class (Male)')
plt.xlabel('Weight Class')
plt.ylabel('Count')
# Plot count of observations by weight class for females
plt.subplot(1, 2, 2)
female_weightclass_counts.loc[sorted_weightclasses].plot(kind='bar')
plt.title('Count of Observations by Weight Class (Female)')
plt.xlabel('Weight Class')
plt.ylabel('Count')
plt.tight_layout() # Adjust spacing between subplots if needed
plt.show()
It appears that there is a decent amount of more data for males than females with the males having a skewness to the heavier weight classes and the females having a skewness towards the lighter weight classes. However, there is still a fairly good amount of data for each weight class which is good to know. Now, let's look at a distribution for the ages seen in powerlifting meets against the total weight lifted. This can help us see how much age plays a factor in strength. Again, we will create seperate graphs for males and females to continue to try and see if there are any discrepencies between the sexes for different factors.
# create ordered categories for birth year class
ordered_birthyear = pd.Categorical(df['BirthYearClass'],
categories=['14-18', '19-23', '24-39', '40-49', '50-59', '60-69', '70-999'],
ordered=True)
# create a copy of the dataframe with ordered birth year class
df_ordered = df.copy()
df_ordered['BirthYearClass'] = ordered_birthyear
df_ordered = df_ordered.sort_values(by='BirthYearClass')
# create separate plots for males and females
male_df = df_ordered[df_ordered['Sex'] == 'M']
female_df = df_ordered[df_ordered['Sex'] == 'F']
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 6))
# plot for males
sns.boxplot(x='BirthYearClass', y='TotalKg', data=male_df, ax=ax1)
ax1.set_title('Males')
# plot for females
sns.boxplot(x='BirthYearClass', y='TotalKg', data=female_df, ax=ax2)
ax2.set_title('Females')
plt.tight_layout()
plt.show()
Now this is a bit more interesting! We see that there is a bit of an inverse quadratic curve for the ages with a peak at the 24-38 year age group. Furthermore, the general trend between males and females seems to almost exactly the same which likely means that gender does not have an impact on relative total when refencing differnt ages of participants. However, this boxplot does lead to want to consider Age has a factor in the strength of an individual. Before we make any true claims, the severity of this relationship will need to be mathmemtically explained and analyzed later in this project. For now, we can keep this relationship in the back of our minds. Moreover, in future visualizations, we may find it useful to see if there is a discreprency in trends when factoring in Age. Another interesting relationship we might want to explore is how much body weight factors into one's strenth. As discussed earlier, there are different strength metrics such as the Wilks Score, Dots scores, and Glossbrenner score. Wilks is the most common so we can create graphs for this for males and females.
# for future use when wanting to compare males and females
male_lifters = df[df['Sex'] == 'M']
female_lifters = df[df['Sex'] == 'F']
# Create subplots
fig, axes = plt.subplots(ncols=2, figsize=(15, 6))
# Plot male data
axes[0].scatter(male_lifters['BodyweightKg'], male_lifters['Dots'])
axes[0].set_xlabel('Bodyweight (kg)')
axes[0].set_ylabel('Dots Score')
axes[0].set_title('Male Lifters')
# Plot female data
axes[1].scatter(female_lifters['BodyweightKg'], female_lifters['Dots'])
axes[1].set_xlabel('Bodyweight (kg)')
axes[1].set_ylabel('Dots Score')
axes[1].set_title('Female Lifters')
y_ticks = [0, 100, 200, 300, 400, 500, 600, 700]
axes[0].set_yticks(y_ticks)
axes[1].set_yticks(y_ticks)
plt.tight_layout()
plt.show()
From these plots, the male and female distributions look fairly similiar and there does not seem to be much of a solid trend forming. There may be room to say that there is a bit of a larger cluster of higher Wilk scores for the lighter body weights but these plots are not super convincing of such a relationship. However, it was still insightful to explore this relationship! Another possibly useful relationship we may want to look into is to see if there is a trend for opening lifts for each of the Squat, Bench, and Deadlift compared to the total amount of weight lifted for the the best lifts summed together. There is a high suspicion that there should be some type of relationship formed between these but what kind? Will it be linear? Quadratic? Something else? Let's look at some plots for males and females to see what the answer to this may be.
fig, axs = plt.subplots(2, 3, figsize=(20, 10))
fig.subplots_adjust(hspace=0.5)
for i, sex in enumerate(['M', 'F']):
df_sex = df[df['Sex'] == sex]
for j, lift in enumerate(['Squat', 'Bench', 'Deadlift']):
col_lift = lift + '2Kg'
col_total = 'TotalKg'
axs[i, j].scatter(df_sex[col_lift], df_sex[col_total], alpha=0.5)
axs[i, j].set_title(sex + ' ' + lift)
axs[i, j].set_xlabel('Second Attempt ' + lift)
axs[i, j].set_ylabel('Total')
plt.show()
From these plots, it appears that squat (for both males and females) is a very linear trend wtih Bench and Deadlift also showing promising signs of a linear trend as well. Bench seems to grow a bit faster and deadlift a bit more gradual as there is a bit of curvature to the trends.
However, this may have been somewhat expected. Most people come into powerlifting meets knowing what they wish to top each lift off at. Moreover, they warm up before going into the first attempt. Generally, the first attempts are lifts that are somewhat close to their total relatively speaking but are something they "know" they should be able to successfully pull off. With this idea, these trends are not too surprising, yet it was still important to graph this to make sure.
We notice that males and females don't have much discrepency in their trends. What if we factor in age? Let's see what happens when we break these plots up by sex and age groups to see if these trends still persist.
# group by sex and birth year class
grouped = df.groupby(['Sex', 'BirthYearClass'])
# iterate over groups and plot correlations
for i, (group, data) in enumerate(grouped):
sex, birthyearclass = group
# get the middle of each birth year class
birthyearclass_mid = [int((int(x.split("-")[0]) + int(x.split("-")[1]))/2) for x in data['BirthYearClass']]
# plot correlations
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
axs[0].scatter(data['Bench1Kg'], data['TotalKg'], alpha=0.5)
axs[0].set_xlabel('Bench 1st Attempt (kg)')
axs[0].set_ylabel('Total Weight (kg)')
axs[1].scatter(data['Squat1Kg'], data['TotalKg'], alpha=0.5)
axs[1].set_xlabel('Squat 1st Attempt (kg)')
axs[1].set_ylabel('Total Weight (kg)')
axs[2].scatter(data['Deadlift1Kg'], data['TotalKg'], alpha=0.5)
axs[2].set_xlabel('Deadlift 1st Attempt (kg)')
axs[2].set_ylabel('Total Weight (kg)')
fig.suptitle(f'{sex} {birthyearclass} - Correlation of First Attempts and Total Weight')
plt.tight_layout()
plt.show()
Earlier we saw that age may be playing a role in one's strength but for plots showing trends for opening lifts factoring in age and sex, it seems to be the case that the trends do persist. Although some of the age groups don't have much data, it is still somewhat convincing that this is the case. Hence, it appears that the starting point for most individuals based off what they likely set out to reach for their top lifts is mostly consistent for everyone. Therefore, we most likely will not need to consider sex and age when trying to find correlations between opening lift attempts and total weight acheived. Quite interesting!
This does start to bring up some new and maybe more interesting questions though. Are there general trends from going from one's first attempt to second attempt? What about second to third attempt? Do bench, squat, and deadlift show similiar trends here? What other factors might contribute to these "jumps" (age, sex, weight, etc)? Let's explore this by creating features based off percent differences between each subsequent lift for each of the SBD events. Moreover, let's compare weight classes first to see what insights we might be able to draw.
# Calculate percentage difference for attempts 1 to 2 and 2 to 3
df['Bench12PercentDiff'] = ((df['Bench2Kg'] - df['Bench1Kg']) / df['Bench1Kg']) * 100
df['Bench23PercentDiff'] = ((df['Bench3Kg'] - df['Bench2Kg']) / df['Bench2Kg']) * 100
df['Squat12PercentDiff'] = ((df['Squat2Kg'] - df['Squat1Kg']) / df['Squat1Kg']) * 100
df['Squat23PercentDiff'] = ((df['Squat3Kg'] - df['Squat2Kg']) / df['Squat2Kg']) * 100
df['Deadlift12PercentDiff'] = ((df['Deadlift2Kg'] - df['Deadlift1Kg']) / df['Deadlift1Kg']) * 100
df['Deadlift23PercentDiff'] = ((df['Deadlift3Kg'] - df['Deadlift2Kg']) / df['Deadlift2Kg']) * 100
# Group the data by lift and calculate mean percentage difference for each lift
lift_groups = df.groupby('WeightClass')[['Bench12PercentDiff', 'Bench23PercentDiff', 'Squat12PercentDiff', 'Squat23PercentDiff', 'Deadlift12PercentDiff', 'Deadlift23PercentDiff']]
mean_percent_diff = lift_groups.mean()
mean_percent_diff
| Bench12PercentDiff | Bench23PercentDiff | Squat12PercentDiff | Squat23PercentDiff | Deadlift12PercentDiff | Deadlift23PercentDiff | |
|---|---|---|---|---|---|---|
| WeightClass | ||||||
| 47kg | 9.217058 | 6.686350 | 8.525675 | 6.063309 | 8.412385 | 6.047253 |
| 52kg | 8.637485 | 6.005720 | 7.613078 | 5.711908 | 8.041704 | 5.826675 |
| 57kg | 8.309339 | 5.893291 | 7.440370 | 5.652076 | 7.760215 | 5.851623 |
| 63kg | 8.021261 | 5.614443 | 7.222643 | 5.475819 | 7.480750 | 5.596313 |
| 72kg | 7.252966 | 5.068700 | 6.784184 | 5.124431 | 7.141596 | 5.246906 |
| 84kg | 6.957221 | 4.861000 | 6.741862 | 5.020052 | 7.114098 | 5.193180 |
| 93kg | 6.789667 | 4.866404 | 6.595476 | 5.010194 | 7.029945 | 5.096721 |
| 105kg | 7.081237 | 4.910506 | 6.737796 | 5.107322 | 7.068021 | 5.224477 |
| 120kg+ | 6.875221 | 5.079433 | 6.754716 | 5.135136 | 7.075729 | 5.154207 |
This data is useful but it's still a bit hard to see if there are differences. Let's get a better visual.
# Create bar plots of mean percentage difference for each lift
lifts = ['Bench', 'Squat', 'Deadlift']
lift_diffs = ['12PercentDiff', '23PercentDiff']
fig, axs = plt.subplots(len(lifts), len(lift_diffs), figsize=(12, 8))
y_ticks = [0,1,2,3,4,5,6,7,8,9,10]
for i, lift in enumerate(lifts):
for j, diff in enumerate(lift_diffs):
col_name = f"{lift}{diff}"
axs[i, j].bar(mean_percent_diff.index, mean_percent_diff[col_name])
axs[i, j].set_title(f"{lift} {diff} mean percentage difference")
axs[i, j].set_xlabel('Weight class')
axs[i, j].set_ylabel('Percentage difference')
axs[i, j].set_yticks(y_ticks)
fig.tight_layout()
plt.show()
So there seems to be slightly higher percent differences for lighter individuals. Let's look into grouping by birth year classes (ie ages grouped together) and see if there is anything interesting going on here.
lifts = ['Squat', 'Bench', 'Deadlift']
lift_diff_cols = ['Squat12PercentDiff', 'Squat23PercentDiff', 'Bench12PercentDiff', 'Bench23PercentDiff', 'Deadlift12PercentDiff', 'Deadlift23PercentDiff']
titles = ["Bench Attempt 1 to 2 % Diff vs Total (by Birth Class)",
"Bench Attempt 2 to 3 % Diff vs Total (by Birth Class)",
"Squat Attempt 1 to 2 % Diff vs Total (by Birth Class)",
"Squat Attempt 2 to 3 % Diff vs Toal (by Birth Class)",
"Deadlift Attempt 1 to 2 % Diff vs Total (by Birth Class)",
"Deadlift Attempt 2 to 3 % Diff vs Total (by Birth Class)"]
x_titles = ["Bench Attempt 1 to 2 Percent Difference",
"Bench Attempt 2 to 3 Percent Difference",
"Squat Attempt 1 to 2 % Diff",
"Squat Attempt 2 to 3 Percent Difference",
"Deadlift Attempt 1 to 2 Percent Difference",
"Deadlift Attempt 2 to 3 Percent Difference"]
total_col = 'TotalKg'
byc_col = 'BirthYearClass'
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15,12))
fig.subplots_adjust(hspace=2)
for i, lift in enumerate(lifts):
lift_diff_cols_for_lift = [col for col in lift_diff_cols if lift in col]
for j, diff_col in enumerate(lift_diff_cols_for_lift):
# Create a scatter plot of Total vs percent difference, split by birth year class
ax = axs[i,j]
for byc in df[byc_col].unique():
mask = df[byc_col] == byc
x = df.loc[mask, diff_col]
y = df.loc[mask, total_col]
ax.scatter(x, y, label=byc)
ax.set_xlabel(x_titles[i*2+j])
ax.set_ylabel(total_col)
ax.set_title(titles[i*2+j])
ax.set_xticks(range(0, 226, 25))
ax.legend(title=byc_col, loc='upper right')
plt.tight_layout()
plt.show()
It seems that some outliers are extending the graphs which makes real trends hard to see. Let's limit our vision to about 30% increases as the maximum value.
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15,12))
fig.subplots_adjust(hspace=2)
for i, lift in enumerate(lifts):
lift_diff_cols_for_lift = [col for col in lift_diff_cols if lift in col]
for j, diff_col in enumerate(lift_diff_cols_for_lift):
# Create a scatter plot of Total vs percent difference, split by birth year class
ax = axs[i,j]
for byc in df[byc_col].unique():
mask = df[byc_col] == byc
x = df.loc[mask, diff_col]
y = df.loc[mask, total_col]
ax.scatter(x, y, label=byc)
ax.set_xlabel(x_titles[i*2+j])
ax.set_ylabel(total_col)
ax.set_title(titles[i*2+j])
ax.set_xlim(0, 30)
ax.set_xticks(range(0, 31, 5))
ax.legend(title=byc_col, loc='upper right')
plt.tight_layout()
plt.show()
Now that's a lot better. From these graphs it seems that each of the age groups follow a similiar trend. Moreover, the trend seems to slightly deviate from a linear one to more of an exponential (inverse logorithm). This is quite interesting! Furthermore, the trends for all three lifts and for attempts 1 to 2 and 2 to 3 all share a similiar looking pattern. This may allow us to create a more meaningful relationship between jumps and total weight lifted!
It seems that we have visualized and explained enough trends at this point to try and make some models. Let's dive into the modeling and machine learning aspect of this project.
First, I dropped some columns we did not use when creating trends in the last section. Then, I again made a CSV file to process the data for this section of the project which is shown below (but commented out since it is not necessary).
#new_df = df.drop(['Date','Glossbrenner', 'Dots', 'Place'], axis=1)
#new_df.to_csv('ML_data.csv', index=False)
# reading CSV through pandas
df = pd.read_csv('ML_data.csv')
# displaying head of Dataframe
df.head()
| Age | Sex | BirthYearClass | BodyweightKg | Squat1Kg | Squat2Kg | Squat3Kg | Best3SquatKg | Bench1Kg | Bench2Kg | ... | Best3DeadliftKg | TotalKg | Wilks | WeightClass | Bench12PercentDiff | Bench23PercentDiff | Squat12PercentDiff | Squat23PercentDiff | Deadlift12PercentDiff | Deadlift23PercentDiff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 43.5 | F | 40-49 | 114.65 | 70.0 | 80.0 | 92.5 | 92.5 | 65.0 | 75.0 | ... | 140.0 | 315.0 | 254.00 | 105kg | 15.384615 | 10.000000 | 14.285714 | 15.625000 | 15.909091 | 9.803922 |
| 1 | 24.5 | F | 24-39 | 58.95 | 102.5 | 110.0 | 115.0 | 115.0 | 57.5 | 60.0 | ... | 142.5 | 320.0 | 361.69 | 57kg | 4.347826 | 4.166667 | 7.317073 | 4.545455 | 5.882353 | 5.555556 |
| 2 | 24.5 | F | 24-39 | 71.00 | 110.0 | 117.5 | 122.5 | 122.5 | 67.5 | 72.5 | ... | 155.0 | 352.5 | 347.30 | 63kg | 7.407407 | 3.448276 | 6.818182 | 4.255319 | 7.272727 | 5.084746 |
| 3 | 30.5 | F | 24-39 | 97.15 | 150.0 | 165.0 | 172.5 | 172.5 | 65.0 | 72.5 | ... | 187.5 | 437.5 | 367.52 | 93kg | 11.538462 | 6.896552 | 10.000000 | 4.545455 | 6.060606 | 7.142857 |
| 4 | 43.5 | F | 40-49 | 114.65 | 70.0 | 80.0 | 92.5 | 92.5 | 65.0 | 75.0 | ... | 140.0 | 315.0 | 254.00 | 105kg | 15.384615 | 10.000000 | 14.285714 | 15.625000 | 15.909091 | 9.803922 |
5 rows × 25 columns
Before we get into modeling the data, I want to explain some topics that will be printed out after traning each model to try and analyze how well the model fitted the data. There are 4 main values I show for each model: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R^2). These are explained below.
Interpretations:
Mean Squared Error (MSE): This measures the average squared difference between the predicted values and the actual values.
Root Mean Squared Error (RMSE): This is the square root of the MSE and is more interpretable as it is in the same units as the target variable.
Mean Absolute Error (MAE): This measures the average absolute difference between the predicted values and the actual values. MAE is less sensitive to outliers compared to MSE.
R-squared (R^2): This measures the proportion of the variance in the target variable that is explained by the independent variables in the model. However, a high R^2 does not necessarily mean that the model is a good fit as it can still overfit the data.
Explanation of values:
It is important to note that we have to bring these values into the context of each of our models so we can understand what they truly mean.